Journal of Chemical Information and Modeling — Latest Matching Preprints

1

Dynamic consensus pocket detection across molecular dynamics ensembles reveals persistent and transient druggable sites

Marigliani, G.; Petrizzelli, F.; Mangoni, M.; Bianco, S. D.; Orzella, I.; Guzzi, P. H.; Caputo, V.; Biagini, T.; Mazza, T.

2026-07-02 bioinformatics 10.64898/2026.06.27.734992 medRxiv

Top 0.1%

72.3%

Show abstract

The traditional 'one drug, one target' paradigm assumes that drugs interact with a single specific binding site. Modern pharmacology has proven this definition overly simplistic and, instead, recognizes that drugs operate within complex biological systems and often interact with multiple targets. In this context, proteins cannot be viewed as possessing a single functional binding site, but rather as dynamic entities capable of accommodating ligands at multiple regions, including transient and cryptic pockets. Here, we review and repurpose representative pocket detection tools across geometry-based, energy-based, and machine/deep learning approaches, originally designed to work on static conformations, to evaluate their agreement on molecular dynamics-derived conformational ensembles. Using GLUT1 protein as a dynamic transporter model and Aldose reductase as a cryptic-pocket reference system, we combine inter-tool concordance, HDBSCAN-based spatial clustering, volumetric IoU analysis, and temporal persistence scoring. Our results show that different algorithmic classes capture complementary aspects of pocket dynamics, with energy-based methods showing stronger sensitivity to transient cryptic regions and geometry-based approaches depending more strongly on pre-formed cavities. This work proposes a consensus-oriented framework for identifying conserved and transient druggable pockets in dynamic protein systems.

2

StructureSAFE: A structure-aware chemical language model for unified hit identification and lead optimization

Yang, B.; Xu, K.; Xiang, C.; Lee, B.; Xu, Y.; Li, T.; Shi, Y.; Sinitskiy, A.; Li, J.

2026-07-02 bioinformatics 10.64898/2026.06.28.735128 medRxiv

Top 0.1%

65.0%

Show abstract

Structure-based generative models (SBGMs) hold great promises for accelerating drug discovery by enabling target-aware molecular design. However, existing approaches face fundamental challenges: three-dimensional graph-based models can explicitly incorporate protein structural information but often generate chemically implausible molecules due to limited training data, while chemical language models (CLMs) produce chemically plausible molecules but struggle to effectively leverage three-dimensional structural information for structure-conditioned generation and hard to incorporate lead optimization functionality due to the nature of SMILES string. Here, we present StructureSAFE, a structure-aware chemical language model that resolves this trade-off by integrating protein structural and evolutionary encoders with the SAFE molecular representation via pretraining and finetuning training scheme, enabling both de novo hit identification and a comprehensive suite of lead optimization subtasks within a unified framework. Comprehensive benchmarking on the MolGenBench dataset demonstrates that StructureSAFE achieves state-of-the-art (SOTA) performance across multiple metrics, with particularly pronounced improvements in chemical plausibility relative to graph-based models lacking pretraining. Evaluation on a rigorously constructed held-out test set further confirms its ability to generate drug-like, synthetically accessible molecules with competitive predicted binding affinities for previously unseen targets on both hit identification and lead optimization setting. In silico case studies across four therapeutically relevant targets validate its capacity to generate chemically plausible molecules that recapitulate key binding interactions of known high-affinity ligands while proposing novel interactions for potential better affinity and exploring previously unknown regions of chemical space. Taking together, StructureSAFE represents a versatile and practical tool to provide high-quality candidate molecules for augmenting medicinal chemistry workflows in both hit identification and lead optimization campaigns.

3

ConfDock: Atom-specific Uncertainty Quantification for Molecular Docking via Conformal Prediction

Hao, H.; Elhendawy, N.; Wang, Y.; Lu, C.

2026-07-01 biochemistry 10.64898/2026.06.29.735353 medRxiv

Top 0.1%

56.0%

Show abstract

Molecular docking is widely used in structure-based drug discovery, yet most approaches provide point estimates without rigorous uncertainty quantification. This limitation makes it difficult to assess when a predicted pose should be trusted, especially when docking methods are applied to diverse protein-ligand systems. We present ConfDock, a conformal prediction (CP) framework for constructing atom-specific prediction intervals for ligand docking poses. ConfDock combines graph neural network (GNN) based quantile estimation with split conformal calibration, producing intervals that adapt to local protein-ligand environments while retaining distribution-free finite-sample coverage guarantees. We evaluate ConfDock on 238 protein-ligand complexes across four docking methods representing distinct computational paradigms. The proposed approach yields substantially narrower prediction intervals compared to standard split CP (57.2% average reduction in mean interval width, up to 74.5%) while maintaining target coverage across all evaluated settings. Ablation analysis indicates that the GNN captures the dominant structure-dependent variability in uncertainty, whereas the conformal calibration step provides a bounded adjustment to ensure coverage guarantees. These results demonstrate that combining learned, structure-aware quantile estimation with conformal calibration enables rigorous uncertainty quantification for molecular docking at atom-level resolution.

4

BoltzMol-1: Towards Reliable Virtual Screening for Fast and Cost-Effective Hit Discovery

Getz, N.; Smith, G.; Colgan, A.; Fan, V.; Cavalleri, L.; Capponi, F.; Wohlwend, J.; Gitter, A.; Kritzer, J.; Maiorano, M.; Wlodarchak, N.; Corso, G.; Passaro, S.

2026-07-06 biochemistry 10.64898/2026.07.04.736485 medRxiv

Top 0.1%

53.0%

Show abstract

We present BoltzMol-1, a small-molecule hit discovery pipeline, centered on an optimized version of Boltz-2, explicitly adapted for prospective discovery. Reliable hit discovery that generalizes across target classes (rather than only the well-characterized families that dominate existing ligand data) would broaden the range of biology accessible to small-molecule intervention and reduce reliance on resource-intensive high-throughput screening. Towards this goal, the system prioritizes compounds for rapid experimental validation by coupling model-driven ranking with streamlined procurement from commercial catalogs. To improve developability at the point of selection, we introduce a suite of ADMET models for kinetic solubility (logS), lipophilicity (logD), and Caco-2 permeability. These models act as an early triage layer, systematically filtering out compounds with unfavorable physicochemical and absorption properties prior to synthesis or purchase. Across a panel of ten targets (most with no representation in the underlying affinity training data) we observe strong prospective performance on challenging systems. Functional actives or binders were identified for 6 of 10 targets, despite modest experimental budgets of 28-96 compounds per target. These results include successes on receptors and enzymes traditionally considered difficult for structure- or ligand-based approaches. Collectively, this work establishes a practical framework for low-throughput, cost constrained discovery campaigns capable of delivering chemically tractable binders with favorable property profiles.

5

F.A.D.E. (Fully Agentic Drug Engine): A Conversational AI Platform for Drug Discovery

Kantorow, J.; Mani, N.; Mohanraj, N. R.; Zong, X.

2026-06-25 biophysics 10.64898/2026.06.20.733481 medRxiv

Top 0.1%

50.2%

Show abstract

Drug discovery remains one of the costliest and most time-intensive endeavors in the pharmaceutical pipeline, with average development costs exceeding $2.3 billion per drug, timelines spanning more than a decade, and attrition rates above 90% in clinical trials. While computational methods have expanded the searchable chemical space, current pipelines remain fragmented and largely inaccessible to researchers without deep interdisciplinary expertise. Here we present F.A.D.E. (Fully Agentic Drug Engine), a multi-agent, open-source platform that converts natural language queries into potential drug candidates, substantially lowering the expertise barrier to advanced computational drug discovery. F.A.D.E. employs a three-branch hierarchical architecture that adapts to the level of available structural data for any protein target, integrating structure prediction, binding pocket detection, equivariant diffusion-based de novo ligand generation, and binding affinity estimation into a single automated pipeline. We validate F.A.D.E. on two structurally distinct targets: the epidermal growth factor receptor kinase domain (EGFR), a well-established oncology target, and cellular retinol-binding protein 1 (CRBP1), a lipid-binding protein involved in retinoid metabolism. For EGFR, our generated candidates achieved QED scores of 0.85 compared to 0.46 for the co-crystallised reference ligand, demonstrating marked improvement in predicted drug-likeness. Results across both targets confirm that F.A.D.E. can reliably generate chemically tractable, drug-like hit compounds across diverse protein classes from simple natural language input.

6

ADMET Property Prediction with Quantum-Inspired Preprocessing

Mansour, B.; Rafaelyan, G.

2026-07-05 bioinformatics 10.64898/2026.06.30.735582 medRxiv

Top 0.1%

45.4%

Show abstract

Accurate prediction of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties is a central challenge in early-stage drug discovery, where experimental determination remains costly and time-consuming. In this work, we propose a quantum-inspired preprocessing framework in which statistical dependencies among molecular descriptors are encoded into a parameterised many-body Hamiltonian, and the expectation values obtained by simulating its time evolution serve as additional inputs to a gradient-boosted ensemble model (CatBoost). Mutual information (MI) is used both to select the most informative descriptors and to set the coupling strengths of the Hamiltonian, so that the induced entanglement structure reflects empirically measured feature correlations; the evolution is realised with a short digitised-counterdiabatic schedule that generates a compact set of expectation-value features while keeping the circuit shallow. The resulting quantum-derived feature vectors are concatenated with the full MapLight descriptor set, concatenated ECFP, Avalon, and ErG fingerprints together with RDKit physicochemical properties, before training. We evaluate the pipeline on the AqSolDB aqueous solubility benchmark from the Therapeutics Data Commons (TDC) platform, achieving a mean absolute error (MAE) of 0.746 +/- 0.006 log(mol/L), which is within the reported error bars of the current top-performing model on the TDC leaderboard (MAE = 0.741 +/- 0.013). Ablation experiments show that the quantum-derived features match classical second-degree polynomial interaction features derived from the same MI-selected subset, while forming a far more compact representation (85 quantum features versus up to 4,950 polynomial terms, an approximately 58-fold reduction). SHapley Additive exPlanations (SHAP) analysis identifies the physicochemical drivers of solubility predictions, offering interpretable insight into model behaviour. These results demonstrate that MI-guided Hamiltonian feature extraction can reproduce the performance of strong classical interaction models on aqueous solubility while generating a compact, interpretable feature representation that is compatible with future quantum execution.

7

MolMAE: A Surface-Centric Multimodal Masked Autoencoder for Molecular Representation Learning

Li, J.

2026-07-14 bioinformatics 10.64898/2026.07.11.737987 medRxiv

Top 0.1%

42.9%

Show abstract

Molecular representation learning has become a central component of modern computational drug discovery. Existing molecular foundation models mainly rely on SMILES strings, two-dimensional molecular graphs, or three-dimensional atomic coordinates. However, many molecular properties are ultimately governed by the molecular surface, where intermolecular recognition, solvation, electrostatic complementarity, and ligand-protein interactions occur. In this work, we propose MolMAE, a surface-guided multimodal masked autoencoder for molecular representation learning. MolMAE takes molecular surface point clouds, three-dimensional molecular graphs, and SMILES-derived fragment and functional-group tokens as complementary input modalities, and learns a unified multimodal molecular embedding through functional-group-aligned masked autoencoding. During pretraining, chemically corresponding local regions are jointly masked across surface, graph, fragment, and functional-group views, forcing the model to reconstruct missing geometric, physicochemical, structural, and semantic information from the remaining context. While molecular surface reconstruction serves as the primary pretraining objective, graph-, fragment-, and functional-group-level reconstruction tasks provide complementary supervision that encourages the model to capture molecular topology, bonding patterns, stereochemistry, local chemical environments, and substructure organization. In addition to reconstructing surface geometry, MolMAE reconstructs surface-associated physicochemical fields, including electrostatic potential and Fukui-related descriptors, enabling the model to learn chemically meaningful surface representations. Pretrained on approximately 261K lead-like bioactive molecules, MolMAE achieves strong performance on the ESOL benchmark under scaffold splitting and competitive performance across multiple molecular property prediction tasks. These results suggest that molecular surface-guided pretraining can complement conventional graph-, sequence-, and atom-coordinate-based molecular representations, especially for property prediction tasks influenced by exposed surface geometry and surface-associated physicochemical patterns.

8

A control-validated pan-proteome deep-learning pipeline nominates GPR35 as a candidate target of the orphan bacterial metabolite ligiamycin A

Martin, J.

2026-07-06 bioinformatics 10.64898/2026.07.01.735807 medRxiv

Top 0.1%

41.1%

Show abstract

Most microbial natural products with documented bioactivity lack an identified molecular target, which limits their development. We present an open, control-validated computational pipeline for natural-product target hypothesis generation. It combines a pan-proteome deep-learning drug-target interaction (DTI) model (a graph neural-network ligand encoder, an ESM-2 protein language-model encoder, and bidirectional cross-attention) with bias-corrected ranking and control-anchored molecular docking. Applying it to ligiamycin A, a 2022-described Streptomyces/Achromobacter co-culture decalin-amino-maleimide with no reported target, we find that the predicted interactions of the compound are dominated by class-A G-protein-coupled receptors. Using a drug with a known target (losartan) we identify and correct a frequent-hitter bias in the raw model; after correction the standout candidates are uniformly class-A GPCRs, led by the orphan receptor GPR35. Structure-based docking with matched positive and negative controls across three candidates corroborates GPR35 specifically: ligiamycin A scores comparably to the known GPR35 agonist zaprinast at the agonist pocket (-8.1 vs -8.3 kcal/mol; non-binder floor -5.5), whereas FFAR1 is excluded and histamine H2 is inconclusive. We propose GPR35 as a prioritized, experimentally testable target and release the workflow as a reusable tool. The result is a computational hypothesis that requires experimental validation.

9

PEPstrMOD2: Next-generation tertiary structure prediction of chemically modified and non-natural peptides

Jain, S.; Mehta, N. K.; Raina, S.; Kumar, P.; Varun, ; Raghava, G. P. S.

2026-07-06 bioinformatics 10.64898/2026.06.22.733733 medRxiv

Top 0.1%

38.4%

Show abstract

While most existing methods are limited to predicting the tertiary structures of proteins containing only canonical residues, the PEPstrMOD server (developed in 2015) pioneered structure prediction for chemically modified and non-natural peptides. Despite its widespread use, the original framework was restricted to peptides of 7 to 25 residues and relied on older backbone-prediction algorithms. To address these limitations, we present PEPstrMOD2, which introduces three major advancements over its predecessor. First, it replaces the original in-house coordinate generation with state-of-the-art deep learning (DL) algorithms, leveraging AlphaFold2 and ESMFold for highly accurate initial structure prediction. Secondly, it greatly expands the accessible chemical space through incorporation of new, AMBER force-field compatible library of 257 post-translational modifications (PTMs), 428 non-canonical amino acids (NCAAs), and 243 terminal modifications. Lastly, through the application of native scalability of AlphaFold2 (AF2) and ESMFold (EF), PEPstrMOD2 eliminates the original restrictions of the length, enabling the structural modeling of longer, complex therapeutic peptides and small proteins. We evaluated the performance of PEPstrMOD2 against state-of-the-art methods across three distinct peptide datasets. For the AfCyc dataset consisting of 80 cyclic peptides, PEPstrMOD2 obtained a competitive average atom-level Root Mean Square Deviation (RMSD) of 2.05 angstroms, compared to 1.13 angstroms by AlphaFold3 (AF3) and 1.82 angstroms by AfCycDesign. Remarkably, for the modified peptide ModPep433 dataset, PEPstrMOD2 outperformed AF3, achieving the lower average RMSD score of 4.49 angstroms against 4.67 angstroms of AF3. Furthermore, in the case of the ModPep16 benchmark, PEPstrMOD2 achieved 2.50 angstroms average RMSD value, which is two times more accurate than that of the original PEPstrMOD (5.84 angstroms). In summary, PEPstrMOD2 provides a powerful, high-throughput, and highly accurate platform to facilitate peptide-based drug development and structural biology research. While the original PEPstrMOD was restricted to a web server interface, PEPstrMOD2 is available as both an intuitive webserver and a standalone command-line tool via GitHub, featuring Docker support for easy deployment and reproducible, large-scale modeling pipelines (https://webs.iiitd.edu.in/raghava/pepstrmod/).

10

BATTLE-AMP: Benchmarking Antimicrobial Peptide Predictors

Szymczak, P.; Bukała, A.; Zarzecki, W.; Sala, M.; Borisek, J.; Fadavi, S.; Olayo-Alarcon, R.; Sroka, J.; Colome-Tatche, M.; Gambin, A.; L. Müller, C.; Setny, P.; Szczurek, E.

2026-06-24 bioinformatics 10.64898/2026.06.19.733349 medRxiv

Top 0.1%

35.0%

Show abstract

As antimicrobial resistance outpaces antibiotic development, antimicrobial peptides (AMPs) have emerged as a promising class of alternative antibacterials, and computational predictors are increasingly used to prioritize AMP candidates. Such predictors are typically evaluated on binary AMP/non-AMP classification, which does not test whether they can identify peptides with clinically relevant potency against specific pathogens. We present BATTLE-AMP, a benchmarking framework that evaluates AMP predictors against experimentally measured minimum inhibitory concentrations (MICs) across clinically relevant bacterial species and strains. We surveyed 48 published methods, finding fewer than 25% reproducible, and benchmarked 10 model families (21 variants) using experimental MIC data, synthetic sequence perturbations, activity cliff analyses, and all-atom molecular dynamics (MD) simulations. Four findings emerge: (i) models trained on MIC data outperform binary classifiers regardless of architecture; (ii) the best model depends on the target pathogen, so model selection must be guided by the biological question; (iii) most models cannot distinguish active peptides from inactive sequences with identical amino acid composition; and (iv) activity cliffs remain unresolved by both machine learning and MD, marking a limit of current computational methods. BATTLE-AMP is released as an open Snakemake framework at https://github.com/szczurek-lab/battleamp-snakemake for benchmarking new models and scoring novel candidate libraries.

11

Molecular dynamics simulations demonstrate reduced antibiotic affinity to mirror bacterial targets

Fady, P.-E.; Ciccone, J.

2026-07-15 molecular biology 10.64898/2026.07.14.738450 medRxiv

Top 0.1%

34.9%

Show abstract

"Mirror life", self-replicating organisms composed of non-natural-chirality biomacromolecules, presents a future threat with potentially global consequences. Consequently, there is strong agreement among experts that it should not be created. However, there is some disagreement over how effective existing medical countermeasures might prove against mirror bacteria in the event that they were created. Here, we leverage computational chemistry methods including docking and molecular dynamics to determine the likely binding efficacy of existing antibiotics against natural and mirror bacterial protein targets. We find that most existing antibiotics fail to bind to mirror bacterial protein targets, unlike their natural-chirality targets. This suggests altered binding of current medical countermeasures, which may impact the antimicrobial activity against mirror bacteria were the latter were created.

12

BBBP_Atlas: Unified Interpretable Modeling of Blood Brain Barrier Permeability across Small Molecules and Peptides

Shen, X.; Su, Q.; Luo, H.; Gou, Q.; Ge, J.; Hou, T.; Wang, J.; Kang, Y.

2026-07-09 bioinformatics 10.64898/2026.07.06.736742 medRxiv

Top 0.1%

32.1%

Show abstract

Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system drug discovery, yet existing models are often limited by their reliance on predefined physicochemical descriptors, small-molecule-centered training sets, or conformation-dependent representations, which restricts their transferability across chemically diverse modalities especially peptides. In addition, publicly available BBBP datasets remain fragmented, inconsistently standardized, and weakly controlled for molecular redundancy, increasing the risk of data leakage and overestimated model performance. In this study, we propose BBBP-Atlas, a structure-aware BBB permeability prediction model designed for unified modeling of small molecules and peptides with the first cross-modal dataset OmniBBBP. Designed to bypass descriptor and conformation dependencies, our model represents standardized molecular structures as atom-level graphs to capture local atom-bond environments and long-range topological dependencies associated with BBB transport. This design enables direct learning of structure-permeability relationships from molecular topology. For model training and evaluation, we curated a cross-modal, redundancy-filtered database OmniBBBP that seamlessly unifies small molecules and complex peptides, containing 10,218 unique compounds with 9,316 small molecules and 902 peptides. BBBP-Atlas achieved an accuracy of 0.8914 and an MCC of 0.7678 on the independent test set. On a balanced external benchmark of 200 compounds, our model reached an AUC of 0.9108, an accuracy of 0.8500, and an MCC of 0.7000, outperforming LightBBB by an absolute MCC gain of 6%. Case studies further showed that BBBP-Atlas captured clinically meaningful BBB permeability patterns, correctly identifying lorlatinib as BBB-permeable and vancomycin as BBB-impermeable with high confidence. The OmniBBBP-backed BBBP-Atlas offers a versatile and cross-modal approach for single-compound prediction, batch screening, and dataset exploration for CNS drug discovery. BBBP-Atlas is available at https://cadd.drugflow.com/bbbp/.

13

Computational Lead Optimization on BACE1: Relative Binding Free Energy Perturbation as the Terminal Refinement Layer

Alejo, K.; Korban, C.; Chung, C.

2026-07-08 biochemistry 10.64898/2026.07.07.737131 medRxiv

Top 0.1%

30.8%

Show abstract

Structure-based drug discovery is known to apply computational methods in a tiered hierarchy, with each layer narrowing the candidate set and refining the binding picture before committing to the next, more expensive step. We present a four-tiered computational benchmarking study evaluating five engines against a panel of 36 compounds targeting B-secretase 1 (BACE1), a validated Alzheimer's disease target with extensive co-crystal ground truth. This study evaluates Flexible Docking and Boltz2 Cofolding as the primary tier, followed by Ensemble Docking, and then Protein-Ligand MD with MM/PBSA and MM/GBSA post-processing. This is then concluded with Relative Binding Free Energy Perturbation (RevFEP) as the terminal refinement layer. Each method was benchmarked against the experimental binding free energies derived from the co-crystal structures spanning -7.85 to -11.35 kcal/mol. Our findings revealed that Flexible Docking reproduced the co-crystal binding mode for 35 of 36 ligands (97.2% within 2.0 A RMSD) but did not rank potency at this resolution. Boltz2 CoFolding provided an orthogonal structural cross-check with a receptor backbone RMSD of 0.293 A against the experimental co-crystal structure. Ensemble Docking identified the optimal receptor conformation for downstream FEP setup. MD with MM/GBSA decomposition identified van der Waals complementarity as the primary potency driver (Pearson r = +0.855, R2 = 0.732 on a 10-compound subset). RevFEP delivered the highest affinity correlation of any method (Pearson r = +0.662, R2 = 0.438, Spearman p = +0.624, mean absolute error 1.02 kcal/mol across all 36 ligands), resolving potency differences within a narrow 3.5 kcal/mol congeneric window that no other engine could discriminate. We characterize what each engine contributes independently and where RevFEP delivers signals no other engine achieves.

14

Improving Generalizability in Whole-Cell Antibiotic Discovery Through Active Learning

Serrano, L. R.; Zhou, A.; Wei, Z.; Stocks, K.-L. K.; Ektefaie, Y.; Gwynne, P. J.; Chen, E.; Krieger, I.; Sacchettini, J.; Aldridge, B.; Hu, L. T.; Farhat, M. R.

2026-07-05 bioinformatics 10.64898/2026.07.04.736489 medRxiv

Top 0.1%

30.7%

Show abstract

Machine learning (ML) has accelerated molecular discovery, yet training models to generalize to out-of-distribution (OOD) chemical spaces remains fundamentally constrained by the high cost of experimental validation. In antibiotic discovery, where whole-cell phenotypic high throughput screening (HTS) is resource-intensive, iterative ML-guided compound selection, or Active Learning (AL), offers a pathway to efficiently navigate available chemical spaces. However, the algorithmic tradeoffs between prioritizing compound novelty (exploration), predicted bioactivity (exploitation), and their impact on OOD generalizability remain unresolved for noisy, whole-cell biological systems. In this work, we systematically evaluate three AL strategies for whole-cell bacterial bioactivity and benchmark their effects on model accuracy, hit rate, and OOD performance. Using retrospective simulations on Mycobacterium tuberculosis HTS data, we identify an optimal AL strategy that balances predicted hit/non-hit novelty with overall hit rate. We then integrate the strategy in a closed-loop Borrelia burgdorferi antibiotic discovery HTS campaign. The AL-guided approach successfully increased the experimental screening hit rate five-fold (from a 0.2% rate within investigator-selected plates to 1.0%). Further, when the trained model was applied in prospective in silico selection of highly diverse compounds across multiple bacterial species, the AL-trained whole-cell inhibition predictor demonstrates 53-fold enrichment over investigator-directed screening (11.0% experimental validation of predicted hits). Of these, 100% demonstrated the intended narrow spectrum activity for Borrelia burgdorferi. These results demonstrate that calibrated AL strategies can overcome data acquisition bottlenecks and train generalizable property predictors able to extrapolate to OOD molecules.

15

Distillation enables scalable high-fidelity virtual screening across ultra-large chemical libraries

Dai, J.; Wang, Y.; Shan, N. L.; Mariani, M.; Yu, Z.; Yan, Q.; Golani, L. K.; Surovtseva, Y. V.; Lee, W. H.; Pusztai, L.

2026-07-03 bioinformatics 10.64898/2026.06.29.735361 medRxiv

Top 0.1%

30.6%

Show abstract

Accurate virtual screening of ultra-large chemical libraries remains challenging. Existing approaches rely on lower-fidelity scoring functions or sampling-based strategies that can limit predictive accuracy and bias the exploration of chemical space. Here, we present FastBindRank, a distillation-based framework that transfers the predictive power of the structure-based model Boltz-2 into an efficient sequence-based surrogate. Trained on ~1% of the 122-million-compound PubChem library, FastBindRank enables high-fidelity screening at scale. Applied to histone deacetylase 11 (HDAC11), FastBindRank substantially enriched high-confidence binders relative to the background chemical space. The lightweight model captured structural patterns associated with predicted binding, revealing structural determinants of binding. Under a comparable computational budget, FastBindRank achieved a 74-fold increase in hit rate and over a 30-fold increase in discovery yield over direct subset-based screening. Experimental validation confirmed the activity of two novel compounds. These results establish distillation as a practical strategy for scalable, high-fidelity virtual screening of ultra-large chemical libraries.

16

Desktop-Scale Hit-Point Discovery for Intrinsically Disordered α-Synuclein Using State-Space Compression and a Discrete Phase-Interference Search Operator

Kim, D. H.; Khenmedekh, G.-O.; Park, i.; Kim, S.

2026-06-28 bioinformatics 10.64898/2026.06.22.733879 medRxiv

Top 0.1%

30.6%

Show abstract

The accessible chemical space dwarfs any tractable screening budget, and most artificial intelligence drug discovery pipelines respond by docking and ranking a small sublibrary. The resulting hit list is agnostic to selectivity, brain penetration, toxicity, synthetic accessibility, and chemical novelty. We present ISTP-DPISO DrugEngine, an end-to-end engine developed by ISTP Tech that integrates the Local Information Criticality Principle (LICP) with a Discrete Phase-Interference Search Operator (DPISO). We demonstrate the engine on the intrinsically disordered protein (IDP) -synuclein, whose non-amyloid-component (NAC, residues 61-95) drives Parkinson-associated aggregation. The resulting LICP active set focuses the expensive LICP-DPISO scoring: in a production-scale run, the engine compressed a ~8.46x108-molecule mirror to a 10,000,000-molecule active set (~85-fold) before scoring, then converged to a compact, safety-gated shortlist plus de novo designs. The entire campaign ran on a single desktop workstation, without any high-performance-computing cluster. Three engine-prioritized, commercially available candidates (2-D08, Uralenol, Herbacetin) and an (-)-epigallocatechin gallate (EGCG) positive control were then tested in a thioflavin-T (ThT) aggregation assay at 100 {micro}M: all three engine-nominated candidates suppressed -synuclein aggregation, giving perfect prospective inhibitor-call concordance (3/3 nominated); together with the EGCG positive control, all four assayed compounds inhibited aggregation (4/4 total), two by [≤]80% plateau reduction. ISTP-DPISO DrugEngine reframes virtual screening from post-hoc score fusion to a single, state-space-compressed, safety-gated, experimentally validated discovery pipeline.

17

Collinearity of Decomposed Energy Terms in MM-GBSA Binding Free Energy Calculations

Sevim, A.; Kocak, A.

2026-06-29 biophysics 10.64898/2026.06.24.734195 medRxiv

Top 0.1%

30.2%

Show abstract

The molecular mechanics-generalized Born surface area method (MMGBSA) is one of the most commonly used end state approaches used for the calculation of the binding free energy towards computational drug design and screening studies. It is customary to break up the free energy into van der Waals, electrostatic, polar solvation (GB), and nonpolar solvation (SA) terms and then either correlate these terms with experiment or assign physical meaning to each term. Here, we demonstrate that this assumption of independent fitting coefficients for decomposed energy terms could be invalid. Through analytic derivation and large-scale molecular dynamics simulations, we show that (i) the protein and ligand Coulomb interaction energy and the GB solvation correction are almost perfectly collinear (R2[≥]0.99) reflecting their designed role as vacuum electrostatics plus solvent screening, and (ii) the van der Waals interaction and SA term likewise exhibit strong correlation, as both depend primarily on buried surface area. Interaction entropy and C2 entropy corrections are also found to be strongly dependent on underlying electrostatic fluctuations, further reinforcing redundancy. These findings hold both at the level of instantaneous trajectory fluctuations and when averaged across a diverse set of 139 protein-protein complexes and persist in both single-trajectory and three trajectory MMGBSA protocols. Our results caution against using decomposed MMGBSA terms as independent predictors in regression models and suggest instead combining correlated terms into effective polar, nonpolar, and entropic contributions. Our study provides a systematic diagnosis of collinearity in MMGBSA and highlights pathways toward more interpretable and statistically robust predictive modeling.

18

Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors

Nael, M.; Alakonda, L.; Ghosh, A.; Ward, S. J.; Liu-Chen, L.-Y.; Rajadhyaksha, A. M.; Abou-Gharbia, M.; Elokely, K. M.

2026-06-24 bioinformatics 10.64898/2026.06.18.732083 medRxiv

Top 0.2%

27.0%

Show abstract

Public bioactivity databases are heterogeneous not only in measurement type, where binding affinities and functional potencies are reported on different scales, but in pharmacology: the same compound and target can carry agonist, antagonist, or inhibitor records measured through binding displacement, cAMP, {beta}-arrestin, or [35S]GTP{gamma}S readouts that quantify different biological events. Pooling these records produces models whose output is detached from any coherent pharmacological claim. Prior work has standardized bioactivity at scale and quantified the noise from mixing measurement types, but pharmacological mechanism and assay-readout class have not been treated as a primary axis of large-scale curation. This study presents an auditable, OECD-anchored framework that stratifies public records by action type and assay readout before modeling, converting heterogeneous data into externally validated, interpretable QSAR tasks that compose with existing standardization resources rather than replacing them. The framework is demonstrated on the four opioid receptors (MOR, DOR, KOR, and nociceptin/orphanin FQ, NOP). Four public sources were reconciled into 72,148 merged records and 50,977 curated measurements spanning 19,585 compounds, each carrying auditable attributes for source agreement, endpoint meaning, pharmacology class, assay readout, and trust tier. Receptor-level binding tasks formed a compact benchmark with strong locked external performance, including KOR pK (R2 = 0.79, n = 798) and DOR pK (R2 = 0.77, n = 736). Pharmacology- and readout-resolved functional endpoints yielded externally validated strata that pooled labels would obscure, including a MOR antagonist functional-inhibition endpoint (R2 = 0.86, n = 110) and agonist potency endpoints for DOR, KOR, and MOR (R2 up to 0.81). Comparison against a fully pooled baseline shows that pooled models either match stratified models on coherent endpoints or reach a deceptively high R2 on functional-IC50 endpoints by training predominantly on binding-displacement records, so the pooled number predicts affinity rather than functional activity. SHAP attribution indicates that binding and functional potency encode partially distinct structure-activity signals. The dataset contract, not model performance alone, defines the validity and scope of a QSAR claim, and stratification is a precondition for a functional model to support a defensible claim. Curation logic, derived tables, frozen data, and reproducibility artifacts are released.

19

BoltzProt-1: Towards Efficient De Novo Binder Design with Good Developability

Ucar, T.; Bates, J.; Fu, Y.; Shi, W.; Stark, H.; Nava, D.; Cavalleri, L.; Wohlwend, J.; Corso, G.; Passaro, S.

2026-06-27 bioinformatics 10.64898/2026.06.23.733997 medRxiv

Top 0.2%

26.0%

Show abstract

Designing binders against novel protein targets remains a central challenge in computational drug discovery. Here we introduce BoltzProt-1, a pipeline for generating protein binders, including nanobodies, with improved hit rates and favorable developability properties. At its core lie a refined iteration of BoltzGens generative model and a novel protein-protein interaction prediction model, BoltzPPI. Employing BoltzPPI instead of BoltzGens standard structure-prediction confidence metrics to rank nanobody (VHH) designs increases the confirmed-binder hit rate from 3.3% to 8.0% across 10 novel targets. Assessed on 10 additional targets used in prior literature, the BoltzProt-1 pipeline obtains nanobody screening hits for 7 of 10 targets, surpassing the 6 of 10 previously reported by Chai-2. Finally, evaluating the developability of BoltzProt-1-designed nanobodies in terms of stability, aggregation, purity, polyspecificity and hydrophobicity reveals that 58% of its confirmed binders pass every criterion, exceeding both BoltzGen (40%) and clinical-stage VHH controls (21%). O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/733997v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@125fb31org.highwire.dtl.DTLVardef@8e7482org.highwire.dtl.DTLVardef@8318a1org.highwire.dtl.DTLVardef@c62ab5_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

AI-guided discovery for low-resource peptide engineering using evolutionary scale modeling

Andrekson, L.; Rydbergh, R.; Mercado, R.; Wenzel, M.

2026-07-01 bioinformatics 10.64898/2026.06.25.734678 medRxiv

Top 0.2%

22.4%

Show abstract

Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering. Yet, it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that the cross validation R2 score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20-500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. SCARSE significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while comparable performance was achieved on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies. Notably, we demonstrate that CV R2 computed from as few as 50 labeled peptides can be sufficient to estimate final active learning end-point performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.